A Fast and Scalable Hardware Architecture for K-means Clustering for Big Data Analysis

نویسنده

Darshika G. Perera

چکیده

The exponential growth of complex, heterogeneous, dynamic, and unbounded data, generated by a variety of fields including health, genomics, physics, and climatology pose significant challenges in data processing and desired speed-performance. Existing processorbased (software-only) algorithms are incapable of analyzing and processing this enormous amount of data efficiently and effectively. Consequently, some kind of hardware support is desirable to overcome the challenges in analyzing big data. Our objective is to provide hardware support for big data analysis to satisfy the associated constraints and requirements. Big data analytics involves many important data mining tasks including clustering, which categorizes data into meaningful groups based on the similarity or dissimilarity among objects. In this research work, we investigate and propose customized hardware architecture for Kmeans clustering, one of the most popular clustering algorithms. Our hardware design can execute multiple computations in parallel to significantly enhance the speed-performance of the algorithm, by exploiting the inherent parallelism and pipelining nature of the operations. We design and develop our hardware architecture on a Field Programmable Gate Array (FPGA)–based development platform. Experiments are performed to evaluate the proposed hardware design with its software counterpart running on an embedded processor on the same development platform. Different hardware configurations (consisting of varying number of v parallel processing elements) are processed on varying data sizes. Our hardware configuration consisting of 32 parallel processing elements (PEs) is executed up to 150 times faster than the software-only solution that is executed by the processor. It is observed that the speedperformance further increases with the number of parallel PEs as well as with the size of the data. These investigations demonstrate that hardware support for clustering algorithms is not only feasible but also crucial to meet the requirements and constraints associated with analyzing and processing big data. Our proposed hardware architecture is generic and parameterized. It is scalable to support larger and varying datasets as well as a varying number of clusters. Dedicated to my family. vii ACKNOWLEDGEMENTS First and foremost, I would like to thank my advisor and mentor, Dr. Darshika G. Perera, for giving me an opportunity to conduct this research work. Her immense knowledge in the field of Data Mining was a guiding light on many occasions. I am also very grateful to her for taking out a lot of time to thoroughly go through this thesis document and provide invaluable feedback. I am also thankful to the thesis panel members, Dr. T.S. Kalkur and Dr. Charlie Wang, for setting aside time from their academic and professional commitments to review and evaluate my research work. A special thanks to my fellow researchers, Navid, Mong and Anne, for providing excellent suggestions during early brainstorming sessions. I would also like to appreciate the efforts put by Ashley and Christina from the UCCS Writing Center to proof-read this document and help me polish it and perfect it. Most importantly, I feel very blessed to have unconditional love and support of my family – my mother, Rajeshwari, my wife, Nisha and my daughter, Rashi, who endured my absence from home, especially during weekends. Their constant support and motivation inspired me to complete this research.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

Fast Data Clustering and Outlier Detection Using K-means Clustering on Apache Spark

The components forming the information society nowadays are seen in all areas of our lives. As computers have a great deal of importance in our lives, the amount of information has begun to gather meaningful and specific qualities. Not only the amount of information is increased, but also the speed of access to information has increased. Large data is the transformed form of all data recovered ...

متن کامل

Improved COA with Chaotic Initialization and Intelligent Migration for Data Clustering

A well-known clustering algorithm is K-means. This algorithm, besides advantages such as high speed and ease of employment, suffers from the problem of local optima. In order to overcome this problem, a lot of studies have been done in clustering. This paper presents a hybrid Extended Cuckoo Optimization Algorithm (ECOA) and K-means (K), which is called ECOA-K. The COA algorithm has advantages ...

متن کامل

Adapting k-means for Clustering in Big Data

Big data if used properly can bring huge benefits to the business, science and humanity. The various properties of big data like volume, velocity, variety, variation and veracity render the existing techniques of data analysis ineffective. Big data analysis needs fusion of techniques for data mining with those of machine learning. The k-means algorithm is one such algorithm which has presence i...

متن کامل

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

A Fast and Scalable Hardware Architecture for K-means Clustering for Big Data Analysis

نویسنده

چکیده

منابع مشابه

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Fast Data Clustering and Outlier Detection Using K-means Clustering on Apache Spark

Improved COA with Chaotic Initialization and Intelligent Migration for Data Clustering

Adapting k-means for Clustering in Big Data

Persistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm

عنوان ژورنال:

اشتراک گذاری